Search CORE

247 research outputs found

Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy

Author: Bertram Ludäscher
Brian
Junfei Qiu
Matei
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 07/10/2018
Field of study

Data collection for scientific applications is increasing exponentially and is forecasted to soon reach peta- and exabyte scales. Applications which process and analyze scientific data must be scalable and focus on execution performance to keep pace. In the field of radio astronomy, in addition to increasingly large datasets, tasks such as the identification of transient radio signals from extrasolar sources are computationally expensive. We present a scalable approach to radio pulsar detection written in Scala that parallelizes candidate identification to take advantage of in-memory task processing using Apache Spark on a YARN distributed system. Furthermore, we introduce a novel automated multiclass supervised machine learning technique that we combine with feature selection to reduce the time required for candidate classification. Experimental testing on a Beowulf cluster with 15 data nodes shows that the parallel implementation of the identification algorithm offers a speedup of up to 5X that of a similar multithreaded implementation. Further, we show that the combination of automated multiclass classification and feature selection speeds up the execution performance of the RandomForest machine learning algorithm by an average of 54% with less than a 2% average reduction in the algorithm's ability to correctly classify pulsars. The generalizability of these results is demonstrated by using two real-world radio astronomy data sets.Comment: In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, Article 11, 11 page

arXiv.org e-Print Archive

Crossref

A Brief Tour through Provenance in Scientific Workflows and Databases

Author: Bertram Ludäscher
Publication venue
Publication date: 03/03/2016
Field of study

Within computer science, the term provenance has multiple meanings, due to different motivations, perspectives, and assumptions prevalent in the respective communities. This chapter provides a high-level “sightseeing tour” of some of those different notions and uses of provenance in scientific workflows and databases.Ope

Illinois Digital Environment for Access to Learning and Scholarship Repository

On the Reusability of Data Cleaning Workflows

Author: Li Lan
Ludäscher Bertram
Publication venue: University of Edinburgh
Publication date: 15/06/2022
Field of study

The goal of data cleaning is to make data fit for purpose, i.e., to improve data quality, through updates and data transformations, such that downstream analyses can be conducted and lead to trustworthy results. A transparent and reusable data cleaning workflow can save time and effort through automation, and make subsequent data cleaning on new data less errorprone. However, reusability of data cleaning workflows has received little to no attention in the research community. We identify some challenges and opportunities for reusing data cleaning workflows. We present a high-level conceptual model to clarify what we mean by reusability and propose ways to improve reusability along different dimensions. We use the opportunity of presenting at IDCC to invite the community to share their uses cases, experiences, and desiderata for the reuse of data cleaning workflows and recipes in order to foster new collaborations and guide future work

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

International Journal of Digital Curation

Games and Argumentation: Time for a Family Reunion!

Author: Ludäscher Bertram
Xia Yilin
Publication venue
Publication date: 12/09/2023
Field of study

The rule "defeated(X)

\leftarrow

attacks(Y,X),

\neg

defeated(Y)" states that an argument is defeated if it is attacked by an argument that is not defeated. The rule "win(X)

\leftarrow

move(X,Y),

\neg

win(Y)" states that in a game a position is won if there is a move to a position that is not won. Both logic rules can be seen as close relatives (even identical twins) and both rules have been at the center of attention at various times in different communities: The first rule lies at the core of argumentation frameworks and has spawned a large family of models and semantics of abstract argumentation. The second rule has played a key role in the quest to find the "right" semantics for logic programs with recursion through negation, and has given rise to the stable and well-founded semantics. Both semantics have been widely studied by the logic programming and nonmonotonic reasoning community. The second rule has also received much attention by the database and finite model theory community, e.g., when studying the expressive power of query languages and fixpoint logics. Although close connections between argumentation frameworks, logic programming, and dialogue games have been known for a long time, the overlap and cross-fertilization between the communities appears to be smaller than one might expect. To this end, we recall some of the key results from database theory in which the win-move query has played a central role, e.g., on normal forms and expressive power of query languages. We introduce some notions that naturally emerge from games and that may provide new perspectives and research opportunities for argumentation frameworks. We discuss how solved query evaluation games reveal how- and why-not provenance of query answers. These techniques can be used to explain how results were derived via the given query, game, or argumentation framework.Comment: Fourth Workshop on Explainable Logic-Based Knowledge Representation (XLoKR), Sept 2, 2023. Rhodes, Greec

arXiv.org e-Print Archive

Automatic Module Detection in Data Cleaning Workflows: Enabling Transparency and Recipe Reuse

Author: Li Lan
Ludäscher Bertram
Parulian Nikolaus
Publication venue: 'Edinburgh University Library'
Publication date: 21/04/2021
Field of study

Before data from multiple sources can be analyzed, data cleaning workflows (“recipes”) usually need to be employed to improve data quality. We identify a number of technical problems that make application of FAIR principles to data cleaning recipes challenging. We then demonstrate how transparency and reusability of recipes can be improved by analyzing dataflow dependencies within recipes. In particular column-level dependencies can be used to automatically detect independent subworkflows, which then can be reused individually as data cleaning modules. We have prototypically implemented this approach as part of an ongoing project to develop open-source companion tools for OpenRefine. Keywords: Data Cleaning, Provenance, Workflow Analysi

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

International Journal of Digital Curation

Exploring Geopolitical Realities through Taxonomies: The Case of Taiwan

Author: Cheng Yi-Yun
Ludäscher Bertram
Publication venue: North American Symposium on Knowledge Organization (NASKO)
Publication date: 01/06/2019
Field of study

In the face of heterogeneous standards and large-scale datasets, it has become increasingly difficult to understand the underlying knowledge structures within complex information systems. These structures may encode latent assumptions that could be susceptible to issues such as ghettoization, bias, erasure, or omission. Inspired by a series of current events in the China-Taiwan conflict on the sovereignty of Taiwan, our research aims to develop methods that can elucidate multiple, often conflicting perspectives and hidden assumptions. We propose the use of a logic-based taxonomy alignment approach to first align and then reconcile distinct but overlapping taxonomies. We specifically examine three relevant taxonomies that list the world entities: (1) ISO 3166 for country codes and subdivisions; (2) the geographic regions of the US Department of Homeland Security; (3) the Center Intelligence Agency’s World Fact Book. Our results highlight multiple alternate views (or Possible Worlds) for situating Taiwan relative to other neighboring entities. We hope that this work can be a first step to demonstrate how different geopolitical perspectives can be represented using multiple, interrelated taxonomies.Ope

Illinois Digital Environment for Access to Learning and Scholarship Repository

University of Washington: ResearchWorks Journal Hosting

Full of beans: a study on the alignment of two flowering plants classification systems

Author: Cheng Yi-Yun
Ludäscher Bertram
Publication venue: Networked Knowledge Organization Systems (NKOS)
Publication date: 01/01/2018
Field of study

Advancements in technologies such as DNA analysis have given rise to new ways in organizing organisms in biodiversity classification systems. In this paper, we examine the feasibility of aligning two classification systems for flowering plants using a logic-based, Region Connection Calculus (RCC-5) approach. The older “Cronquist system” (1981) classifies plants using their mor- phological features, while the more recent Angiosperm Phylogeny Group IV (APG IV) (2016) system classifies based on many new methods including genome-level analysis. In our approach, we align pairwise concepts X and Y from two taxonomies using five basic set relations: congruence (X=Y), inclusion (X>Y), inverse inclusion (X<Y), and disjointness (X!Y). With some of the RCC-5 relationships among the Fabaceae family (beans family) and the Sapindaceae family (maple family) uncertain, we anticipate that the merging of the two classification systems will lead to numerous merged solutions, so- called possible worlds. Our research demonstrates how logic-based alignment with ambiguities can lead to multiple merged solutions, which would not have been feasible when aligning taxonomies, classifications, or other knowledge organization systems (KOS) manually. We believe that this work can introduce a novel approach for aligning KOS, where merged possible worlds can serve as a minimum viable product for engaging domain experts in the loop.Ope

Illinois Digital Environment for Access to Learning and Scholarship Repository

Workflows and Provenance: Toward Information Science Solutions for the Natural Sciences

Author: Gryk Michael R.
Ludäscher Bertram
Publication venue: 'Project Muse'
Publication date: 01/01/2017
Field of study

The era of big data and ubiquitous computation has brought with it concerns about ensuring reproducibility in this new research environment. It is easy to assume that computational methods self-document by their very nature of being exact, deterministic processes. However, similar to laboratory experiments, ensuring reproducibility in the computational realm requires the documentation of both the protocols used (workflows), as well as a detailed description of the computational environment: algorithms, implementations, software environments, and the data ingested and execution logs of the computation. These two aspects of computational reproducibility (workflows and execution details) are discussed within the context of biomolecular Nuclear Magnetic Resonance spectroscopy (bioNMR), as well as the PRIMAD model for computational reproducibility

Illinois Digital Environment for Access to Learning and Scholarship Repository